Document Analysis and Retrieval Tasks in Scientific Digital Libraries

نویسندگان

  • Sujatha Das Gollapalli
  • Cornelia Caragea
  • Xiaoli Li
  • C. Lee Giles
چکیده

Machine Learning (ML) algorithms have opened up new possibilities for the acquisition and processing of documents in Information Retrieval (IR) systems. Indeed, it is now possible to automate several labor-intensive tasks related to documents such as categorization and entity extraction. Consequently, the application of machine learning techniques for various large-scale IR tasks has gathered significant research interest in both the ML and IR communities. This tutorial provides a reference summary of our research in applying machine learning techniques to diverse tasks in Digital Libraries (DL). Digital library portals are specialized IR systems that work on collections of documents related to particular domains. We focus on open-access, scientific digital libraries such as CiteSeerx, which involve several crawling, ranking, content analysis, and metadata extraction tasks. We elaborate on the challenges involved in these tasks and highlight how machine learning methods can successfully address these challenges.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Interactive Search Elements in Digital Libraries

Background and Aim: Interaction in a digital library help users locating and accessing information and also assist them in creating knowledge, better perception, problem solving and recognition of dimension of resources. This paper tries to identify and introduce the components and elements that are used in interaction between user and system in search and retrieval of information in digital li...

متن کامل

An Empirical Evaluation of the Interactive Visualization of Metadata to Support Document Use

Digital Libraries currently focus on delivering documents. Since information needs are often satisfied at the sub-document level, digital libraries should explore ways to support document use as well as retrieval. This paper describes the design and initial evaluation of a technology being developed for document use. It uses interactive visualization of paragraph level metadata to allow rapid g...

متن کامل

Tools for Document Image Retrieval in Digital Libraries: the AIDI System

In the last few years, Digital Libraries became one important application area for Document Image Analysis and Recognition research [1]. In this field, a relevant line of research is Document Image Retrieval (DIR) that aims at finding relevant documents relying on image features only. DIR techniques are used to index not only the textual content of a document, but also its layout, graphical obj...

متن کامل

Digital Libraries and Document Image Analysis Techniques: a Survey

Nowadays, Digital Libraries have become a widely used service to store and share both digital born documents and digital versions of works stored by traditional libraries. Document images are intrinsically non-structured and the structure and semantic of the digitized documents is in most part lost during the conversion. Several techniques related to the Document Image Analysis research area ha...

متن کامل

New Tasks on Collections of Digitized Books

Motivated by the plethora of book digitization projects around the world, the Initiative for the Evaluation of XML Retrieval (INEX) launched a Book Search track in 2007. The track focused on Information Retrieval (IR) tasks, exploring the utility of traditional and structured document retrieval techniques to books. In this paper, we propose four new tasks to be investigated at the Book Search t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014